Giulia Carella
SDSC20 Workshop - Integrating CARTOframes into Spatial Data Science workflows. October 21 2020
Site selection refers to the process of deciding where to open a new or relocate an exiting store/facility by comparing the merits of potential locations. If some attribute of interest is known (e.g. the revenues of existing stores) statistical and machine learning methods can be used to help with the decision process.
In this demo, we will look at where Starbucks should open new coffee shops in Long Island, NY. Specifically, we will build a model for the revenues of the existing stores as a funcion of sociodemographic data and use this model to predict the expected revenue in each block group.
We will:
First of all, install virtualenv inside your project folder to secure your project directory to avoid conflict with your other packages.
pip install virtualenv
After installing this run this command one by one inside your root project directory:
virtualenv venv
source venv/bin/activate
Now Your directory is secure and you can install your required packages inside.
pip install -r requirements.txt
To exit the virtual environment type:
deactivate
from requirements import *
from ppca import PPCA
from utils import *
Go to your CARTO dashboard anc click on API KEY to get your key.
carto_username = 'XXXXXX'
carto_API = 'XXXXXX'
set_default_credentials(username=carto_username, api_key=carto_API)
First we upload the Starbucks data that are stored in your local in the starbucks_long_island.csv file. The data contains the addresses of Starbucks stores, their annual revenue ($) and some store characteristics (e.g. the number of open hours per week).
stores = pd.read_csv('./data/starbucks_long_island.csv')
stores.head(3)
Next, we will plot the distribution of the annual revenues by store.
pandas_hist(stores, 'revenue',bins = 20, textsize = 15, title = 'Annual Revenue ($) by Starbucks store')
Since we only know the addresses, we first need to geocode them to extract the corresponding geographical locations, which will be used to assign to each store the Census data of the corresponding block group.
geocoder = Geocoding()
stores = geocoder.geocode(stores, street='addresslines').data
stores.crs = {'init' :'epsg:4326'}
to_carto(stores,'starbucks_long_island_geocoded', if_exists = 'replace')
bk = [i*10**5 for i in [4, 7, 10, 13, 16]]
Map(
Layer('starbucks_long_island_geocoded',
style = size_bins_style('revenue',
breaks = bk,
size_range = [1, 30],
color = '#00eaff',
stroke_color = '#006bff'),
legends = size_bins_legend(title='Annual Revenue ($)',
description='STARBUCKS',
footer =''),
),show_info = True,
viewport={'zoom': 10, 'lat': 40.646891, 'lng': -73.869378},
size=(920,300)
)
Next, we will download from CARTO DATA Observatory (www.carto.com/data) the demographic and socioeconomic variables that we will use to build a model for the stores revenues. We will use data from the The American Community Survey (ACS), which is an ongoing survey by the U.S. Census Bureau. The ACS is available at the most granular resolution at the census block group level, with the most recent geography boundaries available for 2015. The data that we will use are from the survey run from 2013 to 2017.
As we are interested only in the data for the Long Island area, we first upload the geographical boundaries of this area from the manhattan_long_island.geojson file, which is stored locally.
aoi = gpd.read_file('./data/manhattan_long_island.geojson')
Map(
Layer(aoi,
style = basic_style(color = '#eaff00',
stroke_color = '#eaff00',
stroke_width = 2,
opacity = 0.2
),
),show_info = True,
viewport={'zoom': 7.5, 'lat': 40.845419, 'lng': -72.841376},
size=(920,300)
)
More documentation here: https://carto.com/developers/cartoframes/guides/Data-discovery/
Let's start by exploring the Data Catalog to see what data categories CARTO can offer for data in the US.
Catalog().country('usa').categories
Then check the available providers and geometries and select the most recent ACS data at the block group level
Catalog().country('usa').category('demographics').datasets.to_dataframe().provider_id.unique()
catalog_usa_demo = Catalog().country('usa').category('demographics').datasets.to_dataframe()
catalog_usa_demo_acs = catalog_usa_demo[catalog_usa_demo.provider_id == 'usa_acs']
catalog_usa_demo_acs.geography_name.unique()
catalog_usa_demo_acs_bk = catalog_usa_demo_acs[catalog_usa_demo_acs.geography_name == 'Census Block Group - United States of America']
catalog_usa_demo_acs_bk.id.to_list()
And explore in greater detail the metadata
Dataset.get('carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017').to_dict()
Geography.get('carto-do-public-data.carto.geography_usa_blockgroup_2015').to_dict()
Finally get the data and geometries
df_X = Dataset.get('acs_sociodemogr_b758e778')
q = "select * from $dataset$ where ST_Intersects(geom,st_geogfromtext('{}'))".format(aoi.geometry.astype(str)[0])
df_X = df_X.to_dataframe(sql_query=q)
df_X['geometry'] = df_X['geom'].apply(lambda x: str_to_geom(x))
df_X = GeoDataFrame(df_X,geometry = df_X.geometry)
df_X.crs = {'init' :'epsg:4326'}
df_X.head()
Map(
Layer(df_X,
geom_col = 'geometry',
encode_data=False,
style = basic_style(color = '#eaff00',
stroke_width = 1,
opacity = 0.2
),
),show_info = True,
viewport={'zoom': 10, 'lat': 40.646891, 'lng': -73.869378},
size=(920,300)
)
When comparing data for irregular units like census block groups, extra care is needed for extensive variables, i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population. For extensive variables in fact we need first to normalize them, e.g. by dividing by the total area or the total population, depending on the application. Using the metadata available in CARTO Data Observatory we will check which of all the variables can be considered extensive by looking at their default aggregation method: if the aggregation method is classified as SUM, then the variable is normalized by dividing by the total population.
## Get the default aggregation methods
agg_methods_table = read_carto("""
SELECT column_name,
agg_method
FROM variables_public
WHERE dataset_id = 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017'""",
credentials = Credentials(username='do-metadata', api_key='default_public'))
agg_methods_table.dropna(inplace = True)
agg_methods_table.head()
## Compute densities wrt to total population
for i in agg_methods_table.column_name:
if((agg_methods_table[agg_methods_table.column_name==i].agg_method.iloc[0] == 'SUM') & (i!='total_pop')):
df_X[i + '_dens'] = df_X[i].div(df_X['total_pop'])
df_X.drop(i, axis = 1, inplace = True)
## Drop unused columns
drop_cols = df_X.loc[:, df_X.dtypes != np.float64].columns
drop_cols = [i for i in df_X.columns if 'do_' in i or i in drop_cols and i not in ['geoid','geometry']] + ['total_pop']
df_X.drop(drop_cols, axis = 1, inplace = True)
df_X_cols = list(df_X.drop(['geoid','geometry'], axis = 1).columns)
Missing data are common in surveys, as the ACS. Before using the ACS data as covariates we therefore need to check if there are any missing data and to decide how to impute the missing values.
For each variable, the figure below shows the data points that are missing (corresponding to the rows in white). We notice that the amount of missing observations is typically small but depends on the variable.
fig, ax = plt.subplots(figsize=(60,10), nrows =1, ncols = 1)
msno.matrix(df_X[df_X_cols], labels = True, ax = ax)
Next, we will check the pair-wise correlations between the ACS variables. Correlated covariates are problematic for inference in linear (and generalized linear) models as they lead to inflation of the variance of the regression coefficient (https://en.wikipedia.org/wiki/Variance_inflation_factor) but do not necessarly represent a problem for prediction. However, if correlated variables are present, we can take advantage of these correlations to reduce the dimensionality of the model and save computation time.
Looking at the pair-wise correlations of a random sample of 50 variables, we notice large correlations.
draw_corrm(df_X[random.sample(df_X_cols, 50)])
To impute the missing data and reduce the model dimensionality we will use a probabilistic formulation of Principal Component Analysis (PCA). PCA is a technique to transform data sets of high dimensional vectors into lower dimensional ones that finds a smaller-dimensional linear representation of data vectors such that the original data could be reconstructed from the compressed representation with the minimum square error.
In its probabilistic formulation, probabilistic PCA (PPCA) the complete data are modelled by a generative latent variable model which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components are not observed and are treated as latent variables, which also allows to extend the method to when missing data are present.
More information can be found at http://www.jmlr.org/papers/volume11/ilin10a/ilin10a.pdf.
PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model, which is known as PPCA:
\begin{equation*} Y_{ij} = \mathbf{P}_i \, \mathbf{Z}_j + m_i + \varepsilon_{ij} \quad i = 1, .., d; \, j = 1, .., n \end{equation*}with
\begin{align} p(\mathbf{Z}_j) \sim N(0, \mathbb{1}) \\ p(\varepsilon_{ij}) \sim N(0, \nu) \end{align}Both the principal components $Z$ and the noise $\varepsilon$ are assumed normally distributed. The model can be identified by finding the maximum likelihood (ML) estimate for the model parameters using the Expectation-Maximization (EM) algorithm by minimizing the mean-square error of the observed part of the data. EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components, $Z_i$, are not observed and are treated as latent variables. When missing data are present, in the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the latent variables given the observed variables. In this case, the update EM rules are the following.
E-step: \begin{align} \mathbf{\Sigma}_{\mathbf{Z}_j} = \nu \left(\nu \, \mathbb{1} + \sum_i \mathbf{P}_i \mathbf{P}_i^T \right)^{-1} \\ \overline{\mathbf{Z}}_j = \dfrac{1}{\nu}\mathbf{\Sigma}_{\mathbf{Z}_j} \sum_i \mathbf{P}_i \left(Y_{ij}- m_i \right) \\ m_{i} = \dfrac{1}{n} \sum_j \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \right) \\ \end{align}
M-step: \begin{align} \mathbf{P}_{i} = \left( \sum_j \overline{\mathbf{Z}}_j \overline{\mathbf{Z}}_j ^T + \mathbf{\Sigma}_{\mathbf{Z}_j} \right)^{-1} \sum_j \overline{\mathbf{Z}}_j \, \left(Y_{ij}- m_{ij} \right)\\ \nu = \dfrac{1}{n} \sum_{ij} \left[ \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j - m_i \right)^2 + \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \mathbf{P}_i \, \right] \end{align}
where each row of $\mathbf{P}$ and $\overline{\mathbf{Z}}$ is recomputed based only on those columns of $\overline{\mathbf{Z}}$ which contribute to the reconstruction of the observed values in the corresponding row of the data matrix.
X, var_exp = run_ppca(df_X, df_X_cols, min_obs = 0.8)
X.head()
First, we compute the explained variance as a function of the number of PC scores retained to decide how many PCs we should keep. We decide to keep the PCs up that explain up to 80% of the variance (the black dashed line in the plot below), but this choice is somewhat arbitrary and might vary depending on the application.
plot_pc_var(var_exp, textsize = 15, title = 'PPCA explained variance ratio')
To understand the relashionship between the transformed variables (the PC scores) and the original variable, we can plot the 10 variables most highly correlated with each PC, as shown in the bar plot plot which shows the correlation for each of these variables. For example, we see that the first PC, which is that that explains the most of the variance, is positively correlated with the density of ownner occupied housing units but negatively correlated with the density of renter occupied housing units, as also the map below shows.
Note: depending on the random seed on your computer, the signs of the correlations might change.
plot_pc_corr(X, df_X_cols)
map0 = Map(
Layer(X,
encode_data=False,
style = color_bins_style('pc_0',
palette = 'SunsetDark',
stroke_width = 1,
opacity = 0.5),
legends = color_bins_legend(title='First Principal Component Score (PC0)',
description='',
footer ='ACS'),
popup_hover=[popup_element('pc_0', title='First Principal Component Score (PC0)')]
))
map1 = Map(
Layer(X,
encode_data=False,
style = color_bins_style('owner_occupied_housing_units_dens',
palette = 'SunsetDark',
stroke_width = 1,
opacity = 0.5),
legends = color_bins_legend(title='Density of Owner Occupied Housing Units',
description='',
footer ='ACS'),
popup_hover=[popup_element('owner_occupied_housing_units_dens', title='Density of Owner Occupied Housing Units')]
))
map2 = Map(
Layer(X,
encode_data=False,
style = color_bins_style('housing_units_renter_occupied_dens',
palette = 'SunsetDark',
stroke_width = 1,
opacity = 0.5),
legends = color_bins_legend(title='Density of Renter Occupied Housing Units',
description='',
footer ='ACS'),
popup_hover=[popup_element('housing_units_renter_occupied_dens', title='Density of Renter Occupied Housing Units')]
))
Layout([map0,map1,map2], 1,3,viewport={'zoom': 10, 'lat': 40.646891, 'lng': -73.869378})